---
title: "Sev-2 Handoff: API Gateway Incident 2026-04-13"
date: 2026-04-13
tags:
  - incident
  - sev-2
  - handoff
  - api-gateway
severity: sev-2
status: active
incident-commander: TBD
---

# Sev-2 Handoff — API Gateway Incident

**Date**: 2026-04-13
**Severity**: Sev-2 (per [[On-Call Runbook]] escalation policy — ack within 30 min, resolve within 4 hrs)
**System affected**: [[API Gateway]] (`https://api.acme.dev`, Kong 3.4 / AWS NLB, `us-east-1`)
**PagerDuty policy**: `api-gateway-critical`

---

## Summary

This morning the [[API Gateway]] experienced degraded performance impacting external API traffic. The incident was detected via PagerDuty alert on the `api-gateway-critical` escalation policy. Investigation and mitigation steps were carried out following [[On-Call Runbook#API Gateway Restarts]]. This note hands off current state to the team ahead of the retro.

%%Fill in specific details: error type, trigger, affected routes%%

> [!info] Prior Related Incident
> Review [[Postmortem 2026-03-28]] (connection pool exhaustion during traffic spike) for potentially related context. Check whether the same failure mode contributed here.

---

## Timeline

%%Fill in actual times from PagerDuty and Slack #incident-room%%

| Time (UTC) | Event                                      |
| ---------- | ------------------------------------------ |
| HH:MM      | First alert triggered (PagerDuty)          |
| HH:MM      | On-call acknowledged                       |
| HH:MM      | Joined `#incident-room`, triage started    |
| HH:MM      | Initial diagnosis — root cause suspected   |
| HH:MM      | Mitigation applied                         |
| HH:MM      | Health check passing (`/healthz` returning 200) |
| HH:MM      | Incident resolved / monitoring             |

---

## Current Status

- **Service health**: %%Healthy / Degraded / Down%%
- **Monitoring**: Datadog dashboard — [Kong Datadog Board](https://app.datadoghq.com/dashboard/acme-kong) (`env:production`, `service:api-gateway`)
- **SLO impact**: Target 99.95% availability / p99 < 200ms — %%calculate budget consumed%%

---

## Impact

- **Duration**: %%X hours Y minutes%%
- **Routes affected**: %%e.g. `/v1/orders/*`, `/v1/payments/*` — see [[API Gateway#Key Routes]]%%
- **Users affected**: %%N users / N% of traffic%%
- **Revenue impact**: %%$X if applicable%%
- **SLO budget consumed**: %%X%%%

---

## Actions Taken

1. Acknowledged PagerDuty alert and joined `#incident-room`
2. Checked pod status: `kubectl get pods -n api-gateway`
3. Captured logs before any restarts: `kubectl logs -n api-gateway -l app=kong --tail=200`
4. %%Describe mitigation — e.g. scaled replicas, rolled back config%%
5. Verified health: `curl -s https://api.acme.dev/healthz`

%%Add additional steps taken during response%%

---

> [!warning] Open Risks
> - %%Root cause may not be fully resolved — describe any lingering concerns%%
> - Check whether connection pool exhaustion pattern from [[Postmortem 2026-03-28]] is recurring
> - Confirm Datadog alerts are back to baseline (`env:production`, `service:api-gateway`)
> - %%Any config rollbacks that are temporary and need a permanent fix%%

---

> [!todo] Next Steps Before Retro
> - [ ] Complete this timeline with exact timestamps from PagerDuty and Slack
> - [ ] Copy [[Postmortem Template]] and begin drafting — ==48-hour filing deadline (due 2026-04-15)==
> - [ ] Quantify SLO budget consumed and user impact
> - [ ] Verify no related alerts are still firing on the [Kong Datadog Board](https://app.datadoghq.com/dashboard/acme-kong)
> - [ ] Review [[Postmortem 2026-03-28]] action items — confirm prior remediations are still in place
> - [ ] Prepare retro talking points: what went well, what went wrong, action items

---

## Contacts

| Role          | Name      | Slack        |
| ------------- | --------- | ------------ |
| SRE Lead      | Dana Kim  | @dana.kim    |
| Platform Lead | Alex Chen | @alex.chen   |

---

## References

- [[API Gateway]] — system architecture, routes, and monitoring
- [[On-Call Runbook]] — escalation policy and [[On-Call Runbook#API Gateway Restarts|restart procedures]]
- [[Postmortem Template]] — copy for the formal writeup
- [[Postmortem 2026-03-28]] — prior connection pool exhaustion incident
- [[Severity Matrix]] — severity classification reference
